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Abstract 

We derive an exact and efficient Bayesian regression algorithm for piecewise 
constant functions of unknown segment number, boundary location, and lev- 
els. It works for any noise and segment level prior, e.g. Cauchy which can 
handle outliers. We derive simple but good estimates for the in-segment 
variance. We also propose a Bayesian regression curve as a better way of 
smoothing data without blurring boundaries. The Bayesian approach also 
allows straightforward determination of the evidence, break probabilities and 
error estimates, useful for model selection and significance and robustness 
studies. We discuss the performance on synthetic and real-world examples. 
Many possible extensions will be discussed. 
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1 Introduction 



We consider the problem of fitting a piecewise constant function through noisy one- 
dimensional data, as e.g. in Figure 1, where the segment number, boundaries and 
levels are unknown. Regression with piecewise constant (PC) functions, also known 
as change point detection, has many applications. For instance, determining DNA 
copy numbers in cancer cells from micro-array data, to mention just one recent. 

Bayesian piecewise constant regression (BPCR). Wc provide a full Bayesian 
analysis of PC-regression. For a fixed number of segments we choose a uniform 
prior over all possible segment boundary locations. Some prior on the segment lev- 
els and data noise within each segment is assumed. Finally a prior over the number 
of segments is chosen. From this we obtain the posterior segmentation probability 
distribution (Section 2). In practice we need summaries of this complicated distri- 
bution. A simple maximum (MAP) approximation or mean does not work here. 
The right way is to proceed in stages from determining the most critical segment 
number, to the boundary location, and finally to the then trivial segment levels. We 
also extract the evidence, the boundary probability distribution, and an interesting 
non-PC regression curve including error estimate (Section 3). We derive an exact 
polynomial-time dynamic-programming-type algorithm for all quantities of interest 
(Sections 5 and 8). Our algorithm works for any noise and level prior. We consider 
more closely the Gaussian "standard" prior and heavy-tailed robust-to-outhers dis- 
tributions like the Cauchy, and briefly discuss the non-parametric case (Sections 4 
and 6). Finally, some hyper-parameters like the global data average and variabil- 
ity and local within-level noise have to be determined. We introduce and discuss 
efficient semi-principled estimators, thereby avoiding problematic or expensive nu- 
merical EM or Monte-Carlo estimates (Section 7). We test our method on some 
synthetic examples (Section 9) and some real- world data sets (Section 10). The 
simulations show that our method handles difficult data with high noise and out- 
liers well. Our basic algorithm can (easily) be modified in a variety of ways: For 
discrete segment levels, segment dependent variance, piecewise linear and non-linear 
regression, non-parametric noise prior, etc. (Section 11). 

Comparison to other work. Sen and Srivastava [SS75] developed a frequentist 
solution to the problem of detecting a single (the most prominent) segment boundary 
(called change or break point). Olshen et al. [OVLW04] generalize this method to 
detect pairs of break points, which improves recognition of short segments. Both 
methods arc then (heuristically) used to recursively determine further change points. 
Another approach is penalized Maximum Likelihood (ML). For a fixed number of 
segments, ML chooses the boundary locations that maximize the data likelihood 
(minimize the mean square data deviation). Jong et al. [Jon03] use a population 
based algorithm as minimizer, while Picard et al. [Pic05] use dynamic programming, 
which is structurally very close to our core recursion, to find the exact solution in 
polynomial time. An additional penalty term has to be added to the likelihood in 
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order to determine the correct number of segments. The most principled penalty is 
the Bayesian Information Criterion [Sch78, KW95]. Since it can be biased towards 
too simple [Wea99] or too complex [Pic05] models, in practice often a heuristic 
penalty is used. An interesting heuristic, based on the curvature of the log-likelihood 
as a function of the number of segments, has been used in [Pic05]. Our Bayesian 
regressor is a natural response to penalized ML. Many other regressors exist; too 
numerous to hst them all. Another closely related work to ours is Bayesian bin 
density estimation by Endres and Foldiak [EF05] , who also average over all boundary 
locations, but in the context of density estimation. 

Advantages of Bayesian regression. A full Bayesian approach (when computa- 
tionally feasible) has various advantages over others: A generic advantage is that it 
is more principled and hence involves fewer heuristic design choices. This is particu- 
larly important for estimating the number of segments. Another generic advantage 
is that it can be easily embedded in a larger framework. For instance, one can decide 
among competing models solely based on the (Bayesian) evidence. Finally, Bayes 
often works well in practice, and provably so if the model assumptions are valid. ^ 
We can also extract other information (nearly for free), like probability estimates 
and variances for the various quantities of interest. Particularly interesting is the 
expected level (and variance) of each data point. This leads to a regression curve, 
which is very flat, i.e. smoothes the data, in long and clear segments, wiggles in 
less clear segments, follows trends, and jumps at the segment boundaries. It thus 
behaves somewhat between local smoothing (which wiggles more and blurs jumps) 
and rigid PC-segmentation. 

2 The General Model 

Setup. We are given a sequence y— (yi,...,y„), e.g. times-series data or measure- 
ments of some function at locations l...n, where each yi G M resulted from a noisy 
"measurement"., i.e. we assume that the yi are independently (e.g. Gaussian) dis- 
tributed with means /^^ and^ variances a'^. The data likelihood is therefore^ 

n 

likelihood: P{y\l^',cr') := l[P{yM,<) (1) 

i=l 

^Note that we are not claiming here that BPCR works better than the other mentioned ap- 
proaches. In a certain sense Baycs is optimal if the prior is 'true'. Practical superiority likely 
depends on the type of application. A comparison for micro-array data is in progress [KH06]. 
The major aim of this paper is to derive an efficient algorithm, and demonstrate the gains of 
BPCR beyond bare PC-regression, e.g. the (predictive) regression curve (which is better than local 
smoothing which wiggles more and blurs jumps). 

^More generally, /z- and ct- are location and scale parameters of a symmetric distribution. 

^For notational and verbal simplicity we will not distinguish between probabilities of discrete 
variables and densities of continuous variables. 
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The estimation of the true underlying function /= (/i,...,/„) is called regression. 
We assume or model / as piecewise constant. Consider k segments with segment 
boundaries — to<ti<...<tk-i<tk — n, i.e. / is constant on {tq-i + l,..,tq} for each 
0<q<k. If the noise within each segment is the same, we have 

piecewise constant: fi[ = fig and = cXg for tg^i < i <tg \/q (2) 

We first consider the case in which the variances of all segments coincide, i.e. aq = a 
\fq. Our goal is to estimate the segment levels fi = {fii,...,fik), boundaries i= (to, •••i^fc)) 
and their number k. Bayesian regression proceeds in assuming a prior for these 
quantities of interest. We model the segment levels by a broad (e.g. Gaussian) 
distribution with mean u and variance p^. For the segment boundaries we take 
some (e.g. uniform) distribution among all segmentations into k segments. Finally 
we take some prior (e.g. uniform) over the segment number k. So our prior P{/j,,t,k) 
is the product of 

prior: P{nq\v,p)\/q and P{t\k) and P{k) (3) 

We regard the global variance p^ and mean u oi fi and the in- segment variance cr^ 
as fixed hyper-parameters, and notationally suppress them in the following. We will 
return to their determination in Section 7. 

Evidence and posterior. Given the prior and hkelihood we can compute the data 
evidence and posterior P{y\iJL,t,k) by Bayes' rule: 

avaance: Pi,y) = ^ /p(„|^,t,*)P(M,M)dM 

k,t 

. ■ * ; I ^ P{y\/^,'t,k)P{n,t,k) 
posterior: P{n,t,k\y) = -— 

P{y) 

The posterior contains all information of interest, but is a complex object for prac- 
tical use. So we need summaries like the maximum (MAP) or mean and variances. 
MAP over continuous parameters (yit) is problematic, since it is not reparametriza- 
tion invariant. This is particularly dangerous if MAP is across different dimensions 
{k), since then even a linear transformation (fi^afi) scales the posterior (density) 
exponentially in k (by a''). This severely infiuences the maximum over k, i.e. the 
estimated number of segments. The mean of fj, does not have this problem. On 
the other hand, the mean of t makes only sense for fixed (e.g. MAP) k. The most 
natural solution is to proceed in stages similar to as the prior (3) has been formed. 



3 Quantities of Interest 

We now define estimators for all quantities of interest in stages as suggested in 
Section 2. 
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Quantities of interest. Our first quantities are the posterior of tlie number of 
segments and the MAP segment number 

# segments: P{k\y) and k = argmaxP(A;|j/) 

k 

Second, for each boundary tg its posterior and MAP, given the MAP estimate of k 

boundaries: P{tq\y,k) and ig = aigma.xP{tg\y,k) 

tq 

Different estimates of tg (e.g. the mean or MAP based on the joint t posterior) will be 
discussed later. Finally we want the segment level means for the MAP segmentation 

segment level: P{fj.g\y,i,k) and ^'^ ~ J ^(f^<i\y^^^^)f^<i^f^i 

The estimate {ix,t,k) defines a (single) piecewise constant (PC) function /, which is 
our estimate of /. A (very) different quantity is to Bayes-average over all piecewise 
constant functions and to ask for the mean at location i as an estimate for /j. 

regression curve: and fi = jp(ri\y)^ri 

We will see that /x' behaves similar to a local smoothing of y, but without blurring 
true jumps. Standard deviations of all estimates may also be reported. 



4 Specific Models 

We now complete the specification of the data noise and prior. 

Segment boundeiries. We assume a uniform prior over all segmentations into k 
segments. Since there are (^I^) ways of placing the k — 1 inner boundaries (ordered 
and without repetition) on (l,...,n — 1), we have 

uniform boundary prior: P{t\k) = (^Zj)"^ (4) 

This is the only (additional) essential assumption to be able to derive efficient al- 
gorithms. We now discuss some (purely exemplary) choices for the data noise and 
priors on /i and k. 

Gaussian model. The standard assumption on the noise is independent Gauss: 

1 -fa^ 

Gaussian noise: P(yi|//^, (T^') = — e ^"^i^ (5) 

v27rcr' 



The corresponding standard "conjugate" prior on the means /ig for each segment q 
is also Gauss 

1 {iJ-q-vY 



Gaussian prior: P(/ig|i/, p) = e (^g) 

v27rp 



Cauchy model. The standard problem with Gauss is that it does not handle 
outliers well. If we do not want to or cannot remove outhers by hand, we have 
to properly model them as a prior with heavier tails. This can be achieved by a 
mixture of Gaussians or by a Cauchy distribution: 

1 o' 

Cauchy noise: P{y.\M - (7) 

Note that /^^ and a\ determine the location and scale of Cauchy but are not its mean 
and variance (which do not exist). The prior on the levels /Xg may as well be modeled 
as Cauchy: 

Cauchy prior: P{[iq\v,p) = — (8) 

Actually, the Gaussian noise model may well be combined with a non-Gaussian prior 
and vice versa if appropriate. 

Number of segments. Finally, consider the number of segments /c, which is an 
integer between 1 and n. Sure, if we have prior knowledge on the [minimal,maximal] 
number of segments [kmimkmax] we could/should set P{k)—Q outside this interval. 
Otherwise, any non-extreme choice of P{k) has httle influence on the final results, 
since it gets swamped by the (implicit) strong (exponential) dependence on k of the 
likelihood. So we suggest a uniform prior 

P{k) — 77— for 1 < A; < kmax and otherwise 

i^max 

with kmax — n as default (or kmax < n discussed later) . 

5 Efficient Solution 

Notation. We now derive expressions for all quantities of interest, which need 
time 0{kmax'>T'^) and space 0{n^). Throughout this and the next section we use 
the following notation: k is the total number of segments, t some data index, q 
some segment index, 1 <i < h < j <n are data item indices of segment boundaries 
to<ti<tp<tm<tk, i.e. to = 0, ti = i, tp = h, tm=j, tk = n. Further, yij = {yi+i,...,yj) is 
data with segment boundaries tim—{ti,..-,tm) and segment levels iiim — {lJ'i+i,---,IJ^m)- 
In particular yon — V, tok — t, and //qa; = A*- All introduced matrices below (capital 
symbols with indices) will be important in our algorithm. 

General recursion. For m — l+1, yij is data from a single segment with mean Hm 
whose joint distribution (given segment boundaries and m — l + 1) is 

j 

single segment: P{yij, iJ,m\tm-i,mA) = P{fJ'm) -P(yt|A*m) (9) 

t=i+l 



6 



by the model assumptions (1) and (2). The probabihties for a general but fixed 
segmentation are independent, i.e. 



P{yij,l^lm\tlm,m- I) 



(J 
(J 

(c) 



(10) 

tpm,m-p) (any p) (11) 



(12) 
(13) 



p=l+l t=tp-l+l 

= PiVih, IJ,lp\tip,p- l)P{yhj, fJ'pm 

This is our key recursion. Consider now 

Q{yij:l^lm\m - I) := (4"l*;l\)P(yij, film\th tm: m - I) 

imX-l) Y P{yij:l^lm\tlm:m-l)P{tijn\m-l) 

km ■i=tl<...<tm=j 

XI P{yij,iJ'lm\tlm,m-l) 

ilm '■ i=tl<...<tm=j 

j+p—m 

X] PiVih: l^lp\tlp:P " P{yhj, l^pm\tpm: m - p) 

tp=i+p—l tip :i=ti<...<tp=h tpm ■ h=tp<...<tm=j 

j+p—m 

= Qiyih,i^ip\p-i)Q{yhj,i^pm\m-p) (i4) 

h=i+p—l 

(a) is just an instance of formula P{A) = J2iP{^\Hi)P{Hi) for a partitioning (Hi) of 
the sample space. In (b) we exploited uniformity (4) of P(t;m|m— /) = (^~*;~\)~^ and 
hence its independence from the concrete segmentation tim- In (c) we fix segment 
boundary tp, sum over the left and right segmentations, and finally over tp. 

Left and right recursions. If we integrate (12) over nim, the integral factorizes 
and wc get a recursion in (a quantity that is proportional to) the evidence of i/ij. 
Let us define more generally r*'* "Q- moments" of /ij. 



Q'[{yij\^-i) '■= I Q{yij,fj'im\m-i)fj.'/dfj.im 



(15) 



j+p—m 



-- Q'^iyih\p-i)Qi{yhj\p-i) + Y^t^y^''\^~p'>Q^^y^'^\^~p'> 

h=i+p—l 



h=t 



Depending on whether h<t or h>t, the /i'/' term combines with the right or left 
Q in recursion (14) to Ql, while the other Q simply gets integrated to — 
independent t. The recursion terminates with 

Al^ - Ql{yij\l) = f P(M n P{yt\l^m)l^mdl^m: (0 < z < J < n) (16) 

t=i+l 

Note A'^j = P{yij\tm-i,m) is the evidence and Alj/A'^j = E[fil^\yij,tm-i,m] the r*^ mo- 
ment of = Urn in case ijij is modeled by a single segment. It is convenient to 
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formally start the recursion with Q^{yij\0) = Sij = {Q^^^~-' (consistent with the recur- 
sion) with interpretation that (only) an empty data set {i=j) can have segments. 
Since p was an arbitrary split number, we can choose it conveniently. We need a 
left recursion for r — 0, i — 0, p—l — k, and m—p—l: 

i-i i-i 
W,,- - Q\yoj\k + l) - J]g°(t/o/.|^)g°(y;.,|l) - E^'^'^^- 

h=k h=k 

That is (apart from binomial factors) the evidence of y^j with k+1 segments equals 
the evidence oiy^h with k segments times the single-segment evidence ofy^j, summed 
over all locations h of boundary k. The recursion starts with = ^oj, or more 
conveniently with Loj — SjQ. We also need a right recursion for r — 0, j — n, p — l — 1, 
m—p — k: 



n—k n—k 

40 7-) 

kh 



Rk+i,i Q\yin\k + i) = Yl Q\yMQ\yhn\k) = ^ihR' 

h=i+l h=i+l 

The recursion starts with i?i„ = A^„, or more conveniently with RQi = din. 
Quantities of interest. Note that 

Lkn = Rko = Q%y\k) = {lZl)P{y\k) 

are proportional to the data evidence for fixed k. So the data evidence can be 
computed as 

E P{y) = Y.ny\k)P{k) - 7— E tA ^^^^ 

k=i '^'""^ fc=i U-J 

The posterior of k and its MAP estimate are 

Ck := P{k\y) = — — = — TTT ^ and A; = argmaxCfe (18) 

^{y) l/t-l/ "^"^ k=l..kmax 

Segment boundaries. We now determine the segment boundaries. Consider re- 
cursion (12) for i = l~0, m — k, j — n, but keep tp — h fixed, i.e. do not sum over it. 
Then (13) and (14) reduce to the l.h.s. and r.h.s. of 

{lZl)P(y,iJ',tp\k) = Q(yoh,i^op\p)Q{yhn,i^pk\k-p) (19) 

Integration over /j, gives 

{r_l)P{y,tp\k) - Q%yok\p)Q'{yhn\k-p) 

Hence the posterior probability that boundary p is located at tp — h, given k, is 

n P(i h\ (r-l)P(y^tp\k) Lp,R^,_p,, 
Bph := P{tp^h\y,k) = ,n-i~.^, iTx = F (^0) 



:T_\)P{y\k) hn 



So our estimate for segment boundary p is 

ip := argmaxPltp^ h\y,k) = argmax{Sp^} = arg maxfLp^ji?? ^} (21) 

h h h ^' 

Segment levels. Finally we need the segment levels, given the segment number 
k and boundaries t. The r*'* moment of segment m with boundaries i — tm-i and 
j = im is 

UT = Elu'' \V i k] = Efu^ \V-- i , 11 = / PiVv^ /^m|^m-l,m, '^)lJ'\nd^m ^ 

(22) 

Note that this expression is independent of other segment boundaries and their 
number, as it should. 

Regression curve. Recursion (15) allows in principle to compute the regression 
curve E[/ij|y] by defining (Lp^)^^ and {R'^^^)ki analogous to Lkj and Rki, but this 
procedure needs 0{n^) space and 0{kmax'n^) time, one 0{n) worse than our target 
performance. We reduce probabilities of /xj to probabilities of firn- We exploit the 
fact that in every segmentation, n[ lies in some segment. Let this (unique) segment 
be m with (unique) boundaries i = tm-i<t<tm=j- Then ii[ = fi^a- Summing now 
over all such segments we get 

k t—1 n 

P{l^i\y^k) = J^^^P(/im,t^_i = i,tm = j|y,A;) (23) 

m=l i=0 j=t 

By fixing tp in (13) we arrived at (19). Similarly, dividing the data into three parts 
and fixing ti and we can derive 

{lZ\)P{y,IJ',ti,tm\k) = Q{yoi,IJ'Oi\l)Q{yijlJ'm\m-l)Q{yjnlJ>rnk\k-m) 

Setting l—m—1, integrating over //q/ and /x^^, dividing by {^Zi)P{y\k), and inserting 
into (23), we get 



1 

P{l^'t\y^k) = 1— Lm-l,iQiyij, IIm\'i-)Rk-m,j 
^kn 



m=l i<t<j 

The posterior moments of /Xj, given k, can hence be computed by 
^ 1 ^ 

i<t<j m=l 

While segment boundaries and values make sense only for fixed k (we chose fc), the 
regression curve ji[ could actually be averaged over all k instead of fixing k = k. 
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Relative log-likelihood. Another quantity of interest is how likely it is that y is 
sampled from /. The log-hkelihood of y is 

n 

II := \ogP{y\f) = \ogPiy\fi,t,k) = ^ log P{y,\fi[, a) 

i=l 

Like for the evidence, the number itself is hard to interpret. We need to know how 
many standard deviations it is away from its mean(=entropy). Since noise (1) is 
i.i.d., mean and variance of // are just n times the mean and variance of the log- noise 
distribution of a single data item. For Gaussian and Cauchy noise we get 

Gauss: E[ll\f] ^ ^\og(27rea^), Var[ZZ|/] = f 
Cauchy: E[ll\f] =nlog(47ra), Var[ZZ|/] = fTr^ 



6 Computing the Single Segment Distribution 

We now determine (at least in the Gaussian case efficient) expressions for the mo- 
ments (16) of the distribution (9) of a single segment. 

Gaussian model. For Gaussian noise (5) and prior (6) we get 



27ra / v27rp 

where d = j — i. This is an unnormalized Gaussian integral with the following nor- 
malization, mean, and variance [Bol04, Sec. 10.2]: 

"^-''"^^ A,- - (27r(72)<^/2(l + dp2/^2)l/2 l^^^ 

EK|,„t._,.] = § = '^^^SVaf'^ " ^5:^* (26) 

y " t 

where runs from i + 1 to j. The mean/ variance is just the weighted average of 
the mean/ variance of yij and /im- One may prefer to use the segment prior only 
for determining A^j, but use the unbiased estimators (~) for the moments. Higher 
moments A^j can also be computed from the central moments 

for even r, and for odd r. 
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Other models. Analytic expressions for A^j are possible for all distributions in 
the exponential family. For others like Cauchy we need to perform integral (16) 
numerically. A very simple approximation is to replace the integral by a sum on a 
uniform grid: The stepsizc/range of the grid should be some fraction/multiple of the 
typical scale of the integrand, and the center of the grid should be around the mean. 
A crude estimate of the mean and scale can be obtained from the Gaussian model 
(26) and (27). Or even simpler, use the estimated global mean and variance (28), 
and in-segment variance (29) for determining the range (e.g. [i> — 25p,...,i> + 25p]) 
and stepsize (e.g. a/10) of one grid used for all A^j. Note that if yij really stem 
from one segment, the integrand is typically imimodal and the above estimates for 
stepsize and range arc reasonable, hence the approximation will be good. If i/ij 
ranges over different segments, the discretization may be crude, but since in this 
case, A^j is (very) small, crude estimates are sufficient. Note also that even for the 
heavy-tailed Cauchy distribution, the first and second moments Alj and A'^j exist, 
since the integrand is a product of at least two Cauchy distributions, one prior and 
one noise for each t/j. Preferably, standard numerical integration routines (which 
are faster, more robust and more accurate) should be used. 



7 Determination of the Hyper-Parameters 

Hyper-Bayes and Hyper-ML. The developed regression model still contains 
three (hyper)parameters, the global variance and mean v of /x, and the in-segment 
variance a^. If they are not known, a proper Baycsian treatment would be to assume 
a hyper-prior over them and integrate them out. Since we do not expect a signif- 
icant influence of the hyper-prior (as long as chosen reasonable) on the quantities 
of interest, one could more easy proceed in an empirical Bayesian way and choose 
the parameters such that the evidence P{y\a,u,p) is maximized ( "hyper-ML" ) . (We 
restored the till now omitted dependency on the hyper-parameters). 

Exhaustive (grid) search for the hyper-ML parameters is expensive. For data 
which is indeed noisy piecewise constant, P{y\a^v^p) is typically unimodal^ in {a^v,p) 
and the global maximum can be found more efficiently by greed hill-climbing, but 
even this may cost a factor of 10 to 1000 in efficiency. Below we present a very 
simple and excellent heuristic for choosing {a,v,p). 

Estimate of global mean and variance v and p. A reasonable choice for the 
level mean and variance u and p are the empirical global mean and variance of the 
data y. 

^ n 1 

^> ~ -y"yt and p^ ^ -rY^iVt-i^f (28) 

f=i t=i 

little care is necessary with the in-segment variance a^. If we set it (extremely close) to 
zero, all segments will consist of a single data point with (close to) infinite evidence (see e.g. 
(25)). Assuming k^ax <n eliminates this unwished maximum. Greedy hill-climbing with proper 
initialization will also not be fooled. 
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This overestimates the variance of the segment levels, since the expression also 
includes the in-segment variance cr^, which one may want to subtract from this 
expression. 

Estimate of in-segment veiriance cr^. At first there seems little hope of esti- 
mating the in-segment variance o"^ from y without knowing the segmentation, but 
actually we can use a simple trick. If y would belong to a single segment, i.e. the yt 
were i.i.d. with variance cr^, then the following expressions for cr^ would hold: 



1 



^ n— 1 



i.e. instead of estimating cr^ by the squared deviation of the yt from their mean, 
we can also estimate cr^ from the average squared difference of successive yt- This 
remains true even for multiple segments if we exclude the segment boundaries in 
the sum. On the other hand, if the number of segment boundaries is small, the 
error from including the boundaries will be small, i.e. the second expression remains 
approximately valid. More precisely, we have within a segment and at the boundaries 



E ^ (l/t+i-|/t)^ = 2(t^-t^_i-l)a^ and Y>{yt^+i-yt^f = 2a'^ + {p^+i- p^f 



t — tn 



1+1 



Summing over all k segments and boundaries and solving w.r.t. we get 

■n-1 -| k-1 

^{yt+1 - ytf - ^(/im+l - fimf 



1 



2(n- 1) 
1 

2(n- 1) 



E 



E 



t=i 

n-1 



m=l 



t=i -I 



The last expression holds, since there are k boundaries in n data items, and the 
ratio between the variance of fj, to the in-segment variance is p^/cr^- Hence we may 
estimate cr^ by the upper bound 



^ n— 1 



2(n-l)^ 



(29) 



If there are not too many segments {k<^n) and the regression problem is hard (high 
noise p^cr), this is a very good estimate. In case of low noise (p^a), regression 
is very easy, and a crude estimate of is sufficient. If there are many segments, 

tends to overestimate a^, resulting in a (marginal) bias towards estimating fewer 
segments (which is then often welcome) . 

If the estimate is really not sufficient, one may use (29) as an initial estimate 
for determining an initial segmentation t, which then can be used to compute an 
improved estimate of a^, and possibly iterate. 
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Hyper-ML estimates. Expressions (28) are tlie standard estimates of mean and 
variance of a distribution. They are particularly suitable for (close to) Gaussian 
distributions, but also for others, as long as u and p parameterize mean and variance. 

If mean and variance do not exist or the distribution is quite heavy-tailed, we need 
other estimates. The "ideal" hyper-ML estimates may be approximated as follows. 
If we assume that each data point lies in its own segment, we get 

n 

(i'jP) ~ argmax JJ^ P(|/f|a, I/, p) with 



t=i 



P{yt\cr,iy,p) = J P{yt\p,cr)P{p\iy,p)dp (30) 



The in-segmcnt variance o"^ can be estimated similarly to the last paragraph con- 
sidering data differences and ignoring segment boundaries: 



n-l 



argmax JJ^ P{yt+i - ytW) with 



t=l 



/oo 
P{yt+i = a + A|//, (T)P{yt = a\p, a)da (31) 
-oo 

Note that the last expression is independent of the segment level (this was the whole 
reason for considering data differences) and exact iff yt and yt+i belong to the same 
segment. In general (beyond the exponential family) {v^p^a) can only be determined 
numerically. 

Using median and queirtile. We present some simpler estimates based on median 
and quartiles. Let [y] be the data vector y, but sorted in ascending order. Then, 
item [y]an (where the index is assumed to be rounded up to the next integer) is the 
a-quantile of empirical distribution y. In particular [y]n/2 is the median of y. It is 
a consistent (and robust to outliers) estimator of the mean segment level 



V 



[y\n/2 (32) 



if noise and segment levels have symmetric distributions. Further, half of the data 
points lie in the interval [a, 6], where a:= [t/]n/4 is the first and b:= [2/]3n/4 is the last 
quartile of y. So, using (30), p should be estimated such that 



I 



P{a <yt< h\a,v,p) ^ \ 
Ignoring data noise (assuming (j~0), we get 

p ^ ~ ^-^j^ a = 1 for Cauchy and a = 0.6744 for Gauss, (33) 

2a 
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where a is the quartile of the standard Cauchy/Gauss/other segment prior. For the 
data noise a we again consider the differences Aj :=|/t_|_i— Using (31), d should 
be estimated such that 

where a'=[A]„/4 and 6'= [A] 3^/4 —a'. One can show that 
^ _ [A]3n/4 - [A]„/4 ^.^^ /3 ^ 2 for Cauchy and /3 = 0.6744^2 for Gauss, 

(34) 

where /9 is the quartile of the one time with itself convolved standard 
Cauchy/Gauss/other (noise) distribution. Use of quartiles for estimating a is ro- 
bust to the "outliers" caused by the segment boundaries, so yields better estimates 
than (29) if noise is low. Again, if the estimates are really not sufficient, one may 
iteratively improve them. 



8 The Algorithm 

The computation of A. L. R, E, C, B, ip, /xj^, F, and IJ,'/ by the formulas/recursions 
derived in Section 5, are straightforward. In (16) one should compute the product, 
or in (25), (26), (27) the sum, incrementally from j'^j' + l. Similarly n'/ should be 
computed incrementally by 

t-l n 
i=0 j=t+l 

Typically r = 0,1,2. In this way, all quantities can be computed in time 0{kmax'n^) 
and space O(n^). Space can be reduced to 0{kjnaxn) by computing A on-the-fly 
in the various expressions at the cost of a slowdown by a constant factor. Table 1 
contains the algorithm in pscudo-C code. The complete code including examples 
and data is available at [HutOSa]. Since A^, L, R, and E can be exponentially large 
in n, i.e. huge or tiny, actually their logarithm has to be computed and stored. In the 
expressions, the logarithm is pulled in by log(x-y) =log(a;) +log(y) and \og{x+y) — 
log(x)-|-log(l-|-exp(log(y)— log(x)) ior x>y and similarly for x<y. Instead of A^j 
we have to compute A^JA^j by pulling the denominator into the integral. 



9 Synthetic Examples 

Description. In order to test our algorithm we created various synthetic data sets. 
We considered picccwisc constant functions with noisy observations. The considered 
function was defined —1 in its first quarter, +1 in its second quarter, and in the 
last half. So the function consists of two small and one large segments, with a large 
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Table 1: Regression algorithm in pseudo C code 

EstGauss(7/,n) and EstGeneral(y,n,Q:,/3) compute from data (yi,. ..,?/„), esti- 
mates for u, p, a (hat omitted), and from that the evidence A^j of a single seg- 
ment ranging from i + 1 to j, and corresponding first and second moments Ajj and 
A'^j. The expressions (28), (29), (25), (26), (27) are used in EstGauss() for Gaussian 
noise and prior, and (32), (33), (34) and numerical integration on a uniform Grid in 
EstGeneral() for arbitrary noise and prior P, e.g. Cauchy. [y] denotes the sorted y 
array. Grid is the uniform integration grid, and are additive/multiplicative 
updates, and o denotes arrays. 



EstGauss(2/,n) 



= 2{^) TJU ivt+i -ytf] 

for(i = 0..n) 
[ m = 0; s = 0; 
for(_7 =i-|-l..n) 



\2. 



-yt, 



2. 



</-(T-//;- 



(27ra2)'^/2(l + (ipV(72)V2' 

m/d); 

2 I ^2 



L return (A|jy,i/,p,o-); 



Regression(A,n,fcma3;) takes A, n, 
and an upper bound on the number 
of segments kmax, and computes the 
evidence E—P{y) (17), the probabil- 
ity Cfc=P(A;|2/) of A; segments and its 
MAP estimate k (18), the probability 
Bi = P(3p:tp = i\y,k) that a bound- 
ary is at i (20) and the MAP location 
tp of the p*'* boundary (21), the first 
and second segment level moments 
fip and fip of all segments p (22), and 
the Bayesian regression curve //^ and 
its second moment /^^^ (24). 



EstGeneral(i/,n,Q:,/3) 

r ^=[y]n/2; 
p={[yhn/4-[y]n/4)/'2a; 

for(t = l..n — 1) At = yt^i 
a=([A]3„/4-[A]„/4)/2/3; 
Grid= (§^)n[i/-25p,i/+25p]; 
for(i = 0..n) 

[ for(/iGGrid) R^ = P{i^\u,p); 
for(j = i+ l..n) 

[ for(/iGGrid) R^* = P{yj\p,a); 
[ return (A|jjj,i/,p,o-); 



0,1,2) 



Regression(A|jjj,n,fc^ 



h=i+l^ih-^kh, 



\ for(i = 0..n) { Loi- 
for(A; = 0..n-l) 

[ for(i = 0..n) Lk+l,^ 
[ for(i = 0..n) Rk+i,: 

T7I 7,-1 X^kmax T //n-l\. 

for {k — 0.. kmax) Ck = Lkn/[{^_i)kmaxE]; 

^ = argmaXfe^i..fe_{Cfc}; 



fcn' 



for(i = 0..n) Bi^J2p=oLpiRk^p^i/L 
for(p = 0..A;) tp = argmax/,{Lp/,i?^_p ,^}, 



for(p = 
for(i = 



l..k) p^-- 



■-A\ 



:0..n 



for(j 

-^m—l.i 



ip — lip ip—l t' 



1,2) 



L ^ ij Z^m=l 

/x([ = 0; (r = l,2) 



z + l..n) 



--Q..n-l) 



iorjt- 

[ return (i;,C[],fc,Su,£[],/^,/z[j^); 
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Figure 1: [GL: low Gaussian noise] data 
(blue), PGR (black), BP (red), and 
variance^/^ (green). 



Figure 2: [GM: medium Gaussian noise] 
data (blue), PGR (black), BP (red), and 
variance^/^ (green). 
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Figure 3: [GM: medium Gaussian noise] Figure 4: [GH: high Gaussian noise] data. 

data with Bayesian regression ± 1 std.- 

deviation. 
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Figure 5: [GH: high Gaussian noise] 
data (blue), PGR (black), BP (red), and 
variance^/^ (green). 



Figure 6: [GH: high Gaussian noise] 
data with Bayesian regression ± 1 std.- 
deviation. 



jump at the first and a small jump at the second boundary. For n we chose 100, 
i-e. /i../25 = -l, /26--/50 = +l, and /5i--/ioo = 0. Data yt was obtained by adding 
independent Gaussian/Gauchy noise of same scale a for all t. We considered low 
a = 0.1, medium a = 0.32, and high a = l noise, resulting in an easy, medium, and 
hard regression problem (Figures 1-14). We applied our regression algorithm to these 
6 data sets (named GL,GM,GH,GL,GM,GH), where we modeled noise and prior as 
Gaussian or Gauchy with hyper-parameters also estimated by the Algorithms in 
Table 1. Table 2 contains these and other scalar summaries, like the evidence, 
likelihood, MAP segment number k and their probability. 

Three segment Gaussian with low noise. Regression for low Gaussian noise 
(cr = 0.1) is very easy. Figure 1 shows the data points (l,?/i),..,(100,yioo) together with 
the estimated segment boundaries and levels, i.e. the Piecewise Gonstant Regression 
(PGR) curve (black). The red curve (with the two spikes) is the posterior probability 
that a boundary (break point BP) is at t. It is defined as Bt := J2p=i^pt- 
Bayesian regressor (BPGR) is virtually sure that the boundaries are at ti = 25 (-825 = 
100%) andt2 = 50 (^25 = 99.9994%). The segment levels /ii = -0.98^-1, /i2 = 0.97^ 
1, /t3 = 0.01 ~ are determined with high accuracy i.e. with low deviation (green 
curve) a/y/25 = 2% for the first two and (t/\/50~1-4% for the last segment. The 
Bayesian regression (BR) curve jit is identical to PGR. 

Three segment Gaussian with medium noise. Little changes for medium Gaus- 
sian noise (a = 0.32). Figure 2 shows that the number and location of boundaries 
is still correctly determined, but the posterior probability of the second boundary 
location (red curve) starts to get a little broader (,850 = 87%). The regression curve 
in Figure 3 is still essentially piecewise constant. At t = 50 there is a small kink and 
the error band gets a little wider, as can better be seen in the (kink of the) green 
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-^Var[/x^|..] curve in Figure 2. In Figure 13 we study the sensitivity of our regression 
to the noise estimate a. Keeping everything else fixed, we varied a from 0.1 to 1 
and plotted the log-evidence logP(?/|(7) and the segment number estimate hio) as 
a function of a. We see that our estimate o"^0.35 is close to the hyper-ML value 
(THML = a'rgmaXcrP(?/|o-) ~0.33, which itself is close to the true (T = 0.32. The number 
of segments h is correctly recovered for a wide range of o around a. If cr is chosen 
too small (below the critical value 0.2), BPCR cannot regard typical deviations from 
the segment level as noise anymore and has to break segments into smaller pieces 
for a better fit (A; increases). For higher noise, the critical value gets closer to g. but 
also the estimate becomes (even) better. For lower noise, a overestimates the true 
(7, but BPCR is at the same time even less sensitive to it. 

Three segment Gaussian with high noise. Figure 4 shows the data with Gaus- 
sian noise of the same order as the jump of levels ((T= 1). One can imagine some 
up-trend in the first quarter, but one can hardly see any segments. Nevertheless, 
BPCR still finds the correct boundary number and location of the first boundary 
(Figure 5). The second boundary is one off to the left, since yso was accidentally 
close to zero, hence got assigned to the last segment. The (red) boundary probabil- 
ity curve is significantly blurred, in particular at the smaller second jump with quite 
small ^49 = 12% and ^50 = 10%. The levels themselves are within expected accuracy 
(t/-\/25 = 20% and (t/-\/50~ 14%, respectively, yielding still a PCR close to the true 
function. The Bayesian regression (and error) curve (Figure 6), though, changed 
shape completely. It resembles more a local data smoothing, following trends in the 
data (more on this in the next section). The variance (green curve in Figure 5) has 
a visible bump at i = 25, but only a broad slight elevation around t = 50. 

Three segment Cauchy. The qualitative results for the Cauchy with low noise 
((7 = 0.1) are the same as for Gauss, perfect recovery of the underlying function, and 
is hence not shown. Worth mentioning is that the estimate a based on quartiles is 
excellent(ly close to hyper-ML) even for this low noise (and of course higher noise), 
i.e. is very robust against the segment boundaries. 

Also for medium Cauchy noise ((7 = 0.32, Figure 8) our BPCR does not get fooled 
(even) by (clusters of) "outliers" at i = 16, i = 48,49, and t = 86,89,90. The second 
boundary is one off to the right, since is shghtly too large. Break probabihty Bi 
(red) and variance Var[/iJ|y,A';] (green) are nicely peaked at ti = 25 and ti — hX. 

For high Cauchy noise (0" = 1, Figure 9) it is nearly impossible to see any segment 
(levels) at all. Amazingly, BPCR still recovers three segments (Figure 10), but 
the first boundary is significantly displaced {t\ = 14). and Var[/ij|t/,A;] contain 
many peaks indicating that BPCR was quite unsure where to break. The Bayesian 
regression in Figure 11 identifies an upward trend in the data 1/14:35, explaining the 
difficulty/impossibility of recovering the correct location of the first boundary. 

Cauchy analyzed with Gauss and vice versa. In order to test the robustness of 
BPCR under misspecification, we analyzed the data with Cauchy noise by Gaussian 
BPCR (and vice versa). Gaussian BPCR perfectly recovers the segments for low 
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Figure 7: Posterior segment number 
probability P{k\y) for medium Gaussian 
noise (GM, black), high Cauchy noise 
(CH, blue), medium Cauchy noise with 
Gaussian regression (CMwG, green), 
aberrant gene copy 7^ of chromosome 1 
(Gen(3,l), red), normal gene copy # of 
chromosome 9 (Gen(5,9), pink). 



Figure 8: [CM: medium Cauchy noise] 
data (blue), PGR (black), BP (red), and 
variance"*^/^ (green). 
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Figure 9: [CH: high Cauchy noise] data. 



Figure 10: [CH: high Cauchy noise] 
data (blue), PGR (black), BP (red), and 
variance^/^ (green). 
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Figure 11: [CH: high Cauchy noise] 
data with Bayesian regression ± 1 std.- 
deviation. 



Figure 12: [CMwG: medium Cauchy 
noise] data (blue), but with Gaussian 
PGR (black), BP (red), and variance^/^ 
(green) . 
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Figure 13: [GM: medium Gaussian 
noise] logP{y) (blue) and k (green) as 
function of cr and our estimate a of 
(arg)maXo-P(?/) and k{a) (black trian- 
gles). 
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Figure 14: [GMwG: medium Gauchy 
noise] with Gaussian regression, \ogP{y) 
(blue) and k (green) as function of a and 
our estimate a of {axg)max„P{y) and 
k{a) (black triangles). 
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Figure 15: [Gen31: Aberrant gene 

copy # of chromosome 1] data (blue), 

PGR (black), BP (red), and variance^/^ 
(green) . 
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Figure 16: [Gen31: Aberrant gene copy 
# of chromosome 1] data with Bayesian 
regression ± 1 std. -deviation. 
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Cauchy noise. For medium noise (GMwG, Figure 12) the outlier at t = 49 is not 
tolerated and placed in it own segment, and the last segment is broken in two 
halves, but overall the distortion is less than possibly expected (e.g. not all outliers 
are in own segments). The reason for this robustness can be attributed to the way 
we estimate a. Figure 14 shows that the outliers have increased a far beyond the 
peak of P{y\a), which in turn leads to lower (more reasonable) number of segments. 
This is a nice stabilizing property of a. The other way round, segmentation of 
data with medium Gaussian noise is essentially insensitive to whether performed 
with Gaussian BPGR (Fig. 2 and 3) or Gauchy BPGR (GMwG, not shown), which 
confirms (once again) the robustness of the Gauchy model. But for high noise BPGR 
fails in both misspecification directions. 

10 Real- World Example &6 More Discussion 

Gene copy number data. All chromosomes (except for the sex chromosomes in 
males) in a healthy human cell come in pairs, but pieces or entire chromosomes can 
be lost or multiplied in tumor cells. With modern micro-arrays one can measure the 
local copy number along a chromosome. It is important to determine the breaks, 
where copy-number changes. The measurements are very noisy [Pin98]. Hence this 
is a natural application for piecewise constant regression of noisy (one-dimensional) 
data. An analysis with BPGR of chromosomal aberrations of real tumor samples, its 
biological interpretation, and comparison to other methods will be given elsewhere 
[KH06]. Here, we only show the regression results of one aberrant and one healthy 
chromosome (without biological interpretation). 
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Figure 17: [Gen31: Aberrant gene copy 
# of chromosome 1] logP{y) (blue) and k 
(green) as function of a and our estimate 
a of (arg)maXo-P(t/) and k{a) (black tri- 
angles) . 



Figure 18: [Gen59: normal gene copy # 
of chromosome 9] with Bayesian regres- 
sion. 



The "log-ratios" of a normal cell (and also the A of any cell) are very close to 
Gaussian distributed, so we chose Gaussian BPCR. The log-ratios y of chromosome 
1 of a sample known to have multiple myeloma are shown in Figure 15, together 
with the regression results. Visually, the segmentation is very reasonable. Long 
segments (e.g. t = 89.. 408) as well as very short ones around t = 87 and 641 of 
length 3 are detected. The Bayesian regression curve in Figure 16 also behaves 
nicely. It is very flat i.e. smoothes the data in long and clear segments, wiggles 
in less clear segments, and has jumps at the segment boundaries. Compare this 
to local smoothing techniques [Rin05], which wiggle much more within a segment 
and severely smooth boundaries. In this sense our Bayesian regression curve is 
somewhere in-between local smoothing and hard segmentation. We also see that 
the regression curve has a broad dip around t = 535. .565, although t = 510. .599 has 
been assigned to a single segment. This shows that other contributions breaking the 
segment have been mixed into the Bayesian regression curve. The PGR favor for 
a single segment is close to "tip over" as can be seen from the spikes in the break 
probability (red curve) in this segment. 

The dependence of evidence and segment number on a is shown in Figure 17. Our 
estimate a (black triangle) perfectly maximizes P{y\a) (blue curve). It is at a deep 
slope of P{k\y,a) (green curve), which means that the segmentation is sensitive 
to a good estimate of a. There is no unique (statistically) correct segmentation 
(number). Various segmentations within some range are supported by comparable 
evidence. 

Figure 18 shows a healthy chromosome 9, correctly lumped into one big segment. 
Posterior probability of the number of segments P{k\y). One of the most 
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Table 2: Regression summary 



Gauss, Cauchy, 
Low, Medium, 
High noise. Gene 


true noise 
scale 


data size 


method 


global mean 
estimate 


global deviation 
estimate 


in-segment 
deviation est. 


log-evidence 
logP(y) 


t3 

o 
o 

bjO s 


1 


a 

bi} 
CD 

ft 

o 


2 

CD ^ — ' 
CD -1- 

CD ^ 
X) 1 
CJ=I 


Name 


a 


n 


P 


v 


P 


a 


\ogE 


Z/-E 


K 




GL 


0.10 


100 


G 


-0.01 


0.69 


0.18 


39 


4.9 




3|3 


74%(0|20) 


GM 


0.32 


100 


G 


-0.03 


0.73 


0.35 


-48 


1.2 




3 3 


44% (0 29) 


GH 


LOO 


100 


G 


-0.10 


1.15 


1.03 


-156 


0.3 




3 4 


13%(10|12) 


CL 


0.10 


100 


C 


-0.02 


0.58 


0.09 


-17 


1.0 




3|3 


69%(0|21) 


CM 


0.32 


100 


C 


-0.09 


0.70 


0.27 


-127 


0.8 




3 3 


38% (0 27) 


CH 


1.00 


100 


c 


-0.20 


0.99 


0.86 


-234 


0.9 




34 


12%(11|11) 


GMwC 


0.32 


100 


c 


0.00 


0.49 


0.17 


-70 


1.5 




3|3 


27%(0|26) 


CMwG 


0.32 


100 


G 


0.01 


1.24 


1.22 


-160 


2.9 




5 8 


8%(8|8) 


Gen31 




769 


G 


0.55 


0.45 


0.30 


-283 


-1.5 


15|34 


6%(6|6) 


Gen59 




483 


G 


1.05 


0.47 


0.44 


-336 


-2.3 


11 


8%(0|6) 



critical steps for good segmentation is determining the right segment number, which 
we did by maximizing P{k\y). The whole curves shown in Figure 7 give additional 
insight. A representative selection is presented. 

For truly piecewise constant functions with ko<^n segments and low to medium 
noise, \ogP{k\y) typically raises rapidly with k till ko and thereafter decays approxi- 
mately linear (black curve) . This shows that BPCR certainly does not underestimate 
ko (P(A;< A;o|2/)~0). Although it also does not overestimate ko, only P{k>ko\y)^l, 
but P{ko\y) ^ 1 due to the following reason: If a segment is broken into two (or 
more) and assigned (approximately) equal levels, the curve and hence the likelihood 
does not change. BPCR does not explicitly penalize this, only implicitly by the 
Bayesian averaging (Bayes factor phenomenon [Goo83, JayOS, MacOS]). This gives 
very roughly an additive term in the log-likehhood of |logn for each additional de- 
gree of freedom (segment level and boundary). This observation is the core of the 
Bayesian Information Criterion (BIC) [Sch78, KW95, Wea99]. 

With increasing noise, the acute maximum become more round (blue curve), i.e. 
as expected, BPCR becomes less sure about the correct number of segments. This 
uncertainty gets pronounced under misspecification (green curve), and in particular 
when the true number of segments is far from clear (or nonexistent) like in the 
genome abberation example (red curve). The pink curve shows that logP{k\y) is 
not necessarily unimodal. 

Miscellaneous. Table 2 summarizes the most important quantities of the consid- 
ered examples. 



23 



While using the variance of A as estimate for a tends to overestimate a for low 
noise, the quartile method does not suffer from this (non)problem. 

The usefulness of quoting the evidence cannot be overestimated. While the ab- 
solute number itself is hard to comprehend, comparisons (based on this absolute(!) 
number) are invaluable. Consider, for instance, the three segment medium Gaus- 
sian noise data |/gm from Figure 2. Table 2 shows that log£'(GM) = —48, while 
log£'(GMwC) = —70, i.e. the odds that tjQyi has Cauchy rather than Gaussian noise 
is tiny e^^~™ < 10"^, and similarly the odds that t/cM has Gaussian rather than 
Cauchy noise is e^^r-ieo iq-ia^ r^j^j^ 

can be used to decide on the model to use. 
For instance it clearly indicates that noise in Gcnc31 and Gen59 is not Cauchy for 
which log-evidences would be —398 and —406, respectively. The smallness of the 
relative log-likelihoods does not indicate any gross misspecification. 

The indicated 4*'* segment for GH and CH is spurious, since it has length zero 
(two breaks at the same position). In GeneSl, only 15 out of the indicated 34 
segments are real. The spurious ones would be real had we estimated the breaks 
t jointly, rather than the marginals tp separately. They would often be single data 
segments at the current boundaries, since it costs only a single extra break to cut 
off an "outlier" at a boundary versus two breaks in the middle of a segment. 

In the last column we indicated the confidence Cj, {Cj,_^,C^_^-^) of BPCR in the 
estimate k. For clean data (GL,GM,CL,GM) it is certain that there are at least 
3 segments. We already explained the general tendency to also believe in higher 
number of segments. 

11 Extensions 8^ Outlook 

The core Regression ( A, n,A;max) algorithm does not care where the in-segment evi- 
dence matrix and moments A come from. This allows for plenty of easy extensions 
of the basic idea. 

If the segment levels are known to belong to a discrete set (e.g. integer DNA 
copy numbers [PRLD05]), this simply corresponds to a discrete prior on ji and leads 
naturally to a Grid sum (rather than by need) as in EstGcneral(). 

If each segment can have its own (unknown) variance cr^, we can assume some 
prior over Um and average (16) (which depends on notationally suppressed) 
additionally over cr^. Possibly P{am\---) depends on some hyper-parameter that 
now has to be estimated instead of a\ all the better if not. 

We assumed a constant regression function within a segment. Actually any other 
function could be used. We simply choose likelihood and prior for a single segment 
and compute its evidence . This is all what Regression () needs to determine the 
segment number and boundaries. Once we have the segment boundaries it is easy 
to compute the in-segment quantities we are interested in, e.g. the MAP or mean 
regression curve. 

For instance, if we consider all linear functions within a segment, we get a piece- 
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wise linear regression curve. But note that this curve is not continuous. This model 
is, for instance good, if the true function is essentially piecewise constant, but there 
is an additional underlying trend (slope) in the segments. Using non-linear functions 
allows to handle more complicated trends. 

Piecewise linear (or other) continuous regression is more complicated. Assume 
that fip in (12) does not denote the level of the whole segment p, but its level 
at the right boundary, which together with Hp-i determines the linear function in 
segment p. Only after fixing /ip, left and right side decouple. So the recursion 
analogous to (15) now involves a quantity Q which in addition to (ij) also depends 
on This functional recursion may approximately be solved by discretizing 

{{lJLi4JLrn) ^ -K^}; oi by approximating Q by a 2-dimcnsional Gaussian in {fii,fim) 
and storing only the 2 means and the 2x2 covariance matrix for each The 
following two simpler heuristic approaches may work sufficiently well in practice: 
One could ignore the continuity constraint when determining the boundaries, and 
only take them into account in the subsequent (much simpler) regression problem 
with known boundaries. Another possibility is to consider instead of the continuous 
piecewise linear function / its piecewise constant derivative /', i.e. use BPCR on 
and finally integrate the result. 

It is also not necessary to use a parametric model for the noise. If different 
segments can have different noise distributions, we could compute the in-segment 
evidence, mean, and variance A^j based on some (fast) non-parametric model. 
If all segments have the same distribution, we could non-parametrically estimate 
a single density for the differences A and then deconvolve the density (e.g. by 
FFT-^(yFFT(density)), and henceforth use this as prior for a in EstGeneral(). As 
non-parametric density estimator we could use the fast (linear-time) exact Bayesian 
tree model [Hut 05b]. 

Finally, for (very) large n, say > 1000, the 0{kmax^'^) algorithm is too slow. 
Fortunately, there is nearly no interaction between distant segments; boundary is 
often practically independent of where ifc±2, ^fc±3, etc. are placed. This suggests to 
break the whole data set into smaller overlapping pieces, where each piece should be 
long enough to contain at least four segments. Then boundaries t^'^'^'^ ^...^t^^!'^'^^ of each 
piece are used, and appropriately merged. For the Bayesian regression curve one 
should use some blending on the overlap. If single segments are very long, one could 
coarsen (locally lump together) the data and later refine around the boundaries. 

12 Summary 

We considered Bayesian regression of piecewise constant functions with unknown 
segment number, location and level. We derived an efficient algorithm that works for 
any noise and segment level prior, e.g. Cauchy which can handle outliers. We derived 
simple but good estimates for the in-segment variance. We also proposed a Bayesian 
regression curve as a better way of smoothing data without blurring boundaries. 
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The Bayesian approach also allowed us to straightforwardly determine the global 
evidence, break probabilities and error estimates, useful for model selection and 
significance and robustness studies. We discussed the performance on synthetic and 
real-world examples. Many possible extensions have been discussed. 

Acknowledgements. Thanks to lOSI for providing the gene copy # data and to 
Ivo Kwee for discussions. 
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